This dataset contains information related to music tracks from a rebuilt version and subset of The Million Song Dataset. It was built up with lastfm-spotify-tags-sim-userdata,The Echo Nest Taste Profile Subset & lastfm-dataset-2020, tagtraum genre annotations, and Spotify API. It consists of the following columns:
Peculiarities:
This dataset contains missing values in the 'tags' and 'genre' columns. Additionally, the 'genre' column includes inconsistent labels for music genres (e.g., 'Rock' and 'RnB'). The dataset also includes various audio features (e.g., danceability, energy, loudness) represented by numerical values.
Process of Analysis:
1. Data Loading: The first step in the analysis process involved loading the 'trackInfo.csv' file into a DataFrame using the pandas library.
2. Data Cleaning: The dataset was inspected for missing values and inconsistencies in the 'genre' column. Missing values in the 'tags' column may be handled based on the analysis requirements.
3. Data Exploration: Basic exploratory data analysis (EDA) techniques were applied to understand the structure and data types of the dataset. Summary statistics and visualizations might be used to gain insights into the distribution of track features.
4. Data Analysis: The dataset may be analyzed to identify trends in track features, such as danceability, energy, or acousticness, across different genres or years.
5. Data Visualization: Visualizations like histograms, box plots, or scatter plots could be utilized to represent the distribution and relationships between various audio features and other attributes of the tracks.
6. Insights and Conclusions:
Analysis of the audio features and genres can provide insights into music preferences, genre popularity, and how specific musical characteristics impact the tracks' reception. Understanding the relationships between these features can help music platforms and artists tailor their offerings to cater to user demands and improve user engagement.
This dataset contains information related to music tracks and user interactions from a rebuilt version and subset of The Million Song Dataset. It was built up with lastfm-spotify-tags-sim-userdata,The Echo Nest Taste Profile Subset & lastfm-dataset-2020, tagtraum genre annotations, and Spotify API. It consists of the following columns:
Peculiarities:
One peculiarity of this dataset is that it contains missing values (NaN) in the 'genre' column. Additionally, the dataset may contain duplicate records for some music tracks or users, which could impact the analysis.
Process of Analysis:
1. Data Loading: The first step in the analysis process involved loading the 'musicInfo.csv' file into a DataFrame using the pandas library.
2. Data Cleaning: The dataset was inspected for missing values, duplicate records, and unnecessary columns. The 'genre' column with missing values may be dropped or filled based on the analysis requirements.
3. Data Exploration: Basic exploratory data analysis (EDA) techniques were applied to gain insights into the dataset's structure, data types, and memory usage. Summary statistics and visualizations were used to understand the distribution of playcounts and identify potential patterns.
4. Handling Missing Values: If the 'genre' column was required for the analysis, the missing values could be filled using appropriate methods like forward fill, backward fill, or mode imputation.
5. Handling Duplicates: If duplicate records were identified, appropriate actions could be taken, such as dropping duplicates or aggregating the data based on specific criteria.
6. Data Visualization: Further data visualization techniques like histograms, box plots, or scatter plots might be used to visualize the distribution and relationships of the data.
7. Analysis Insights: The analysis aimed to uncover insights into music preferences, track popularity, and genre trends. Understanding the impact of music attributes on playcounts and popularity could help optimize music content for a larger audience.
Conclusion:
This dataset provides valuable information about music track playcounts and user interactions. Analyzing this dataset can reveal patterns and trends in music preferences, helping music platforms tailor their offerings to meet user demands and improve user engagement. It is crucial to handle missing values and duplicates appropriately to ensure accurate and meaningful analysis results.
This dataset contains information related to music tracks and user interactions from MusicBrainz and Last.fm . It consists of the following columns:
Peculiarities:
One peculiarity of this dataset is that it contains missing values (NaN) in the 'artist_lastfm,' 'country_lastfm,' 'tags_mb,' and 'tags_lastfm' columns. Additionally, it includes a boolean column 'ambiguous_artist' that flags whether an artist's name is ambiguous.
Process of Analysis:
1. Data Loading: The first step in the analysis process involved loading the 'musicInfo.csv' file into a DataFrame using the pandas library.
2. Data Cleaning: The dataset was inspected for missing values and unnecessary columns. Depending on the analysis requirements, missing values in relevant columns could be handled by either dropping the rows or filling the missing values with appropriate methods.
3. Data Exploration: Basic exploratory data analysis (EDA) techniques were applied to understand the structure and data types of the dataset. Summary statistics and visualizations might be used to gain insights into the distribution of playcounts and listener counts.
4. Data Visualization: Further data visualization techniques like histograms, box plots, or scatter plots might be used to visualize the distribution and relationships of the data, such as comparing playcounts with listeners, or analyzing the popularity of different tags.
5. Analysis Insights: The analysis aims to uncover insights into music preferences, track popularity, and the impact of various factors on playcounts and scrobbles. Understanding the relationship between artists, countries, and music tags can help optimize music content and improve user engagement on the platform.
Conclusion:
This dataset provides valuable information about music tracks, artists, and user interactions on a music platform. Analyzing this dataset can reveal patterns and trends in music preferences, helping to optimize music content and improve the overall user experience. Proper handling of missing values and consideration of the 'ambiguous_artist' column will ensure accurate and meaningful analysis results.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Read musicInfo.csv
musicInfo = pd.read_csv("musicInfo.csv")
print(musicInfo.info())
print(musicInfo.head(10))
# Drop rows with missing values in musicInfo DataFrame
musicInfo.dropna(inplace=True)
# Read userListeningHistory.csv
listeningHistory = pd.read_csv("userListeningHistory.csv")
print(listeningHistory.info())
# Drop rows with missing values in listeningHistory DataFrame
listeningHistory.dropna(inplace=True)
# Read artists.csv
artists = pd.read_csv("artists.csv", low_memory=False)
print(artists.info())
print(artists.head(10))
# Drop rows with missing values in artists DataFrame
artists.dropna(inplace=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50683 entries, 0 to 50682
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 track_id 50683 non-null object
1 name 50683 non-null object
2 artist 50683 non-null object
3 spotify_preview_url 50683 non-null object
4 spotify_id 50683 non-null object
5 tags 49556 non-null object
6 genre 22348 non-null object
7 year 50683 non-null int64
8 duration_ms 50683 non-null int64
9 danceability 50683 non-null float64
10 energy 50683 non-null float64
11 key 50683 non-null int64
12 loudness 50683 non-null float64
13 mode 50683 non-null int64
14 speechiness 50683 non-null float64
15 acousticness 50683 non-null float64
16 instrumentalness 50683 non-null float64
17 liveness 50683 non-null float64
18 valence 50683 non-null float64
19 tempo 50683 non-null float64
20 time_signature 50683 non-null int64
dtypes: float64(9), int64(5), object(7)
memory usage: 8.1+ MB
None
track_id name artist \
0 TRIOREW128F424EAF0 Mr. Brightside The Killers
1 TRRIVDJ128F429B0E8 Wonderwall Oasis
2 TROUVHL128F426C441 Come as You Are Nirvana
3 TRUEIND128F93038C4 Take Me Out Franz Ferdinand
4 TRLNZBD128F935E4D8 Creep Radiohead
5 TRUMISQ128F9340BEE Somebody Told Me The Killers
6 TRVCCWR128F9304A30 Viva la Vida Coldplay
7 TRXOGZT128F424AD74 Karma Police Radiohead
8 TRMZXEW128F9341FD5 The Scientist Coldplay
9 TRUJIIV12903CA8848 Clocks Coldplay
spotify_preview_url spotify_id \
0 https://p.scdn.co/mp3-preview/4d26180e6961fd46... 09ZQ5TmUG8TSL56n0knqrj
1 https://p.scdn.co/mp3-preview/d012e536916c927b... 06UfBBDISthj1ZJAtX4xjj
2 https://p.scdn.co/mp3-preview/a1c11bb1cb231031... 0keNu0t0tqsWtExGM3nT1D
3 https://p.scdn.co/mp3-preview/399c401370438be4... 0ancVQ9wEcHVd0RrGICTE4
4 https://p.scdn.co/mp3-preview/e7eb60e9466bc3a2... 01QoK9DA7VTeTSE3MNzp4I
5 https://p.scdn.co/mp3-preview/0d07673cfb46218a... 0FNmIQ7u45Lhdn6RHhSLix
6 https://p.scdn.co/mp3-preview/ab747fed1bfab2ac... 08A1lZeyLMWH58DT6aYjnC
7 https://p.scdn.co/mp3-preview/5a09f5390e2862af... 01puceOqImrzSfKDAcd1Ia
8 https://p.scdn.co/mp3-preview/95cb9df1b056d759... 0GSSsT9szp0rJkBrYkzy6s
9 https://p.scdn.co/mp3-preview/24c7fe858b234e3c... 0BCPKOYdS2jbQ8iyB56Zns
tags genre year duration_ms \
0 rock, alternative, indie, alternative_rock, in... NaN 2004 222200
1 rock, alternative, indie, pop, alternative_roc... NaN 2006 258613
2 rock, alternative, alternative_rock, 90s, grunge RnB 1991 218920
3 rock, alternative, indie, alternative_rock, in... NaN 2004 237026
4 rock, alternative, indie, alternative_rock, in... RnB 2008 238640
5 rock, alternative, indie, pop, alternative_roc... NaN 2005 198480
6 rock, alternative, indie, pop, alternative_roc... NaN 2013 235384
7 rock, alternative, indie, alternative_rock, in... NaN 1996 264066
8 rock, alternative, indie, pop, alternative_roc... Rock 2007 311014
9 rock, alternative, indie, pop, alternative_roc... NaN 2002 307879
danceability ... key loudness mode speechiness acousticness \
0 0.355 ... 1 -4.360 1 0.0746 0.001190
1 0.409 ... 2 -4.373 1 0.0336 0.000807
2 0.508 ... 4 -5.783 0 0.0400 0.000175
3 0.279 ... 9 -8.851 1 0.0371 0.000389
4 0.515 ... 7 -9.935 1 0.0369 0.010200
5 0.508 ... 10 -4.289 0 0.0847 0.000087
6 0.588 ... 8 -7.903 1 0.1050 0.153000
7 0.360 ... 7 -9.129 1 0.0260 0.062600
8 0.566 ... 5 -7.826 1 0.0242 0.715000
9 0.577 ... 5 -7.215 0 0.0279 0.599000
instrumentalness liveness valence tempo time_signature
0 0.000000 0.0971 0.240 148.114 4
1 0.000000 0.2070 0.651 174.426 4
2 0.000459 0.0878 0.543 120.012 4
3 0.000655 0.1330 0.490 104.560 4
4 0.000141 0.1290 0.104 91.841 4
5 0.000643 0.0641 0.704 138.030 4
6 0.000000 0.0634 0.520 137.973 4
7 0.000092 0.1720 0.317 74.807 4
8 0.000014 0.1200 0.173 146.365 4
9 0.011500 0.1830 0.255 130.970 4
[10 rows x 21 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9711301 entries, 0 to 9711300
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 track_id object
1 user_id object
2 playcount int64
dtypes: int64(1), object(2)
memory usage: 222.3+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1466083 entries, 0 to 1466082
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mbid 1466083 non-null object
1 artist_mb 1466075 non-null object
2 artist_lastfm 986756 non-null object
3 country_mb 662368 non-null object
4 country_lastfm 211498 non-null object
5 tags_mb 119946 non-null object
6 tags_lastfm 381075 non-null object
7 listeners_lastfm 986760 non-null float64
8 scrobbles_lastfm 986760 non-null float64
9 ambiguous_artist 1466083 non-null bool
dtypes: bool(1), float64(2), object(7)
memory usage: 102.1+ MB
None
mbid artist_mb \
0 cc197bad-dc9c-440d-a5b5-d52ba2e14234 Coldplay
1 a74b1b7f-71a5-4011-9441-d0b5e4122711 Radiohead
2 8bfac288-ccc5-448d-9573-c33ea2aa5c30 Red Hot Chili Peppers
3 73e5e69d-3554-40d8-8516-00cb38737a1c Rihanna
4 b95ce3ff-3d05-4e87-9e01-c97b66af13d4 Eminem
5 95e1ead9-4d31-4808-a7ac-32c3614c116b The Killers
6 164f0d73-1234-4e2c-8743-d77bf2191051 Kanye West
7 5b11f4ce-a62d-471e-81fc-a69a8278c7da Nirvana
8 9c9f1380-2516-4fc9-a3e6-f9f61941d090 Muse
9 0383dadf-2a4e-4d10-a46a-e9e041da8eb3 Queen
artist_lastfm country_mb country_lastfm \
0 Coldplay United Kingdom United Kingdom
1 Radiohead United Kingdom United Kingdom
2 Red Hot Chili Peppers United States United States
3 Rihanna United States Barbados; United States
4 Eminem United States United States
5 The Killers United States NaN
6 Kanye West United States United States
7 Nirvana United States United States
8 Muse United Kingdom United Kingdom
9 Queen United Kingdom United Kingdom
tags_mb \
0 rock; pop; alternative rock; british; uk; brit...
1 rock; electronic; alternative rock; british; g...
2 rock; alternative rock; 80s; 90s; rap; metal; ...
3 pop; dance; hip hop; reggae; contemporary r b;...
4 turkish; rap; american; hip-hop; hip hop; hiph...
5 synthpop; alternative rock; american; new wave...
6 synthpop; pop; american; hip-hop; hip hop; ele...
7 rock; alternative rock; 90s; punk; american; e...
8 rock; electronic; synthpop; alternative rock; ...
9 rock; progressive rock; 70s; 80s; 90s; pop-roc...
tags_lastfm listeners_lastfm \
0 rock; alternative; britpop; alternative rock; ... 5381567.0
1 alternative; alternative rock; rock; indie; el... 4732528.0
2 rock; alternative rock; alternative; Funk Rock... 4620835.0
3 pop; rnb; female vocalists; dance; Hip-Hop; Ri... 4558193.0
4 rap; Hip-Hop; Eminem; hip hop; pop; american; ... 4517997.0
5 indie; rock; indie rock; alternative; alternat... 4428868.0
6 Hip-Hop; rap; hip hop; rnb; Kanye West; seen l... 4390502.0
7 Grunge; rock; alternative; alternative rock; 9... 4272894.0
8 alternative rock; rock; alternative; Progressi... 4089612.0
9 classic rock; rock; 80s; hard rock; glam rock;... 4023379.0
scrobbles_lastfm ambiguous_artist
0 360111850.0 False
1 499548797.0 False
2 293784041.0 False
3 199248986.0 False
4 199507511.0 False
5 208722092.0 False
6 238603850.0 False
7 222303859.0 False
8 344838631.0 False
9 191711573.0 False
track_id_playcount = listeningHistory.groupby('track_id').playcount.agg(['count', 'sum'])
complete_info = track_id_playcount.merge(musicInfo, on='track_id')
# Group by 'year' and sum 'playcount'
yearly_playcount = complete_info.groupby('year').sum(numeric_only=True)['sum']
# Creating a DataFrame with all years and initializing with 0
all_years = pd.DataFrame({'year': np.arange(1900, 2021), 'playcount': 0})
# Set 'year' as index in both DataFrames for the merge operation
all_years.set_index('year', inplace=True)
yearly_playcount = yearly_playcount.to_frame().rename(columns={'sum': 'playcount'})
# Merge dataframes
final_counts = all_years.merge(yearly_playcount, left_index=True, right_index=True, how='left')
# Fill NaN values with 0
final_counts.fillna(0, inplace=True)
# Sum the playcounts if there are multiple columns
final_counts['playcount'] = final_counts.sum(axis=1)
final_counts = final_counts.reset_index()
# Filter the data from 1950 to 1980
filtered_data_1 = final_counts[(final_counts['year'] >= 1950) & (final_counts['year'] <= 1980)]
# Create a subplot for the first time period (1950 to 1980)
plt.figure(figsize=(18, 12))
plt.subplot(2, 1, 1)
plt.plot(filtered_data_1['year'], filtered_data_1['playcount'], marker="o", linestyle="-")
plt.xlabel('Year')
plt.ylabel('Playcount')
plt.title('Year vs Playcount (1950 to 1980)')
plt.xticks(range(1950, 1981), rotation=90)
plt.grid(True)
# Filter the data from 1980 to 2020
filtered_data_2 = final_counts[(final_counts['year'] >= 1980) & (final_counts['year'] <= 2020)]
# Create a subplot for the second time period (1980 to 2020)
plt.subplot(2, 1, 2)
plt.plot(filtered_data_2['year'], filtered_data_2['playcount'], marker="o", linestyle="-")
plt.xlabel('Year')
plt.ylabel('Playcount')
plt.title('Year vs Playcount (1980 to 2020)')
plt.xticks(range(1980, 2021), rotation=90)
plt.grid(True)
plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt
merged_df = listeningHistory.merge(musicInfo, on='track_id')
filtered_post_war = merged_df[(merged_df['year'] >= 1950) & (merged_df['year'] <= 1980)]
filtered_digital = merged_df[(merged_df['year'] >= 1981) & (merged_df['year'] <= 2020)]
grouped_post_war = filtered_post_war.groupby('genre')['playcount'].agg('sum')
grouped_digital = filtered_digital.groupby('genre')['playcount'].agg('sum')
fig1, ax1 = plt.subplots()
ax1.bar(grouped_post_war.index, grouped_post_war.values)
ax1.set_xlabel('Genre')
ax1.set_ylabel('Total Playcount')
ax1.set_title('Total Playcount per Genre from 1945 to 1980 (Post-War Era)')
plt.xticks(rotation=90)
plt.show()
fig2, ax2 = plt.subplots()
ax2.bar(grouped_digital.index, grouped_digital.values)
ax2.set_xlabel('Genre')
ax2.set_ylabel('Total Playcount')
ax2.set_title('Total Playcount per Genre from 1981 to 2020 (Digital Age)')
plt.xticks(rotation=90)
plt.show()
import matplotlib.pyplot as plt
merged_df = listeningHistory.merge(musicInfo, on='track_id')
merged_df['genre'] = merged_df['genre'].str.lower()
filtered_df = merged_df[(merged_df['year'] >= 2000) & (merged_df['year'] <= 2015)]
genres = ['country', 'electronic', 'pop', 'rap', 'metal', 'rock']
filtered_df = filtered_df[filtered_df['genre'].isin(genres)]
grouped = filtered_df.groupby(['year', 'genre'])['playcount'].agg('sum').reset_index()
fig1, ax1 = plt.subplots()
for genre in genres:
if genre == 'rock':
continue
data = grouped[grouped['genre'] == genre]
ax1.plot(data['year'], data['playcount'], label=genre, marker="o", linestyle="-")
ax1.set_xlabel('Year')
ax1.set_ylabel('Total Playcount')
ax1.set_title('Total Playcount per Genre from 2000 to 2015 (Excluding Rock)')
ax1.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
ax1.grid(True)
plt.show()
fig2, ax2 = plt.subplots()
data = grouped[grouped['genre'] == 'rock']
ax2.plot(data['year'], data['playcount'], label='rock', marker="o", linestyle="-")
ax2.set_xlabel('Year')
ax2.set_ylabel('Total Playcount')
ax2.set_title('Total Playcount for Rock from 2000 to 2015')
ax2.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
ax2.grid(True)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Sample 10% of your data
sample_artists = artists.sample(frac=1)
# Convert the genres to a set for fast lookup
genre_set = set(musicInfo['genre'].str.lower())
tags_series = sample_artists['tags_lastfm'].str.split(';')
flattened_tags = tags_series.explode().str.strip()
# Get the unique tags that are also in genre_set
matching_tags = flattened_tags[flattened_tags.isin(genre_set)].unique()
selected_artists = sample_artists[sample_artists['tags_lastfm'].apply(lambda tags: any(tag in matching_tags for tag in str(tags).split(';')))].copy()
# Create a new column 'matching_tags' in selected_artists DataFrame that contains the list of matching tags for each row
selected_artists.loc[:, 'matching_tags'] = selected_artists['tags_lastfm'].apply(lambda tags: [tag for tag in str(tags).split(';') if tag in matching_tags])
# Explode 'matching_tags' so that each row represents one tag and its corresponding scrobblescore
exploded_artists = selected_artists.explode('matching_tags')
def normalize(series):
return (series - series.min()) / (series.max() - series.min())
# Apply logarithmic transformation
exploded_artists['log_scrobbles'] = np.log(exploded_artists['scrobbles_lastfm'] + 1) # Adding 1 to handle 0 values
# Apply normalization to log-transformed scrobbles
exploded_artists['normalized_log_scrobbles'] = normalize(exploded_artists['log_scrobbles'])
plt.figure(figsize=(12, 8))
tags = exploded_artists['matching_tags'].unique()
# Create a dictionary to hold normalized scrobbles for each tag
normalized_scrobbles_dict = {tag: exploded_artists.loc[exploded_artists['matching_tags']==tag, 'normalized_log_scrobbles'].values for tag in tags}
# Create a list of lists for boxplot data
boxplot_data = [normalized_scrobbles_dict[tag] for tag in tags]
plt.boxplot(boxplot_data, vert=True)
plt.xticks(range(1, len(tags)+1), tags, rotation=90)
plt.xlabel('Matching Tags ')
plt.ylabel('Normalized Log Scrobblescore')
plt.title('Normalized Log Scrobblescore Distribution for Each Tag')
plt.tight_layout()
plt.show()
complete_info = track_id_playcount.merge(musicInfo, on='track_id')
complete_info = complete_info[(complete_info['year'] >= 2000) & (complete_info['year'] <= 2015)]
# Group by 'genre' and count unique 'track_id'
songs_per_genre = complete_info.groupby('genre')['track_id'].nunique()
# Convert to DataFrame and add a 'percentage' column
songs_per_genre = songs_per_genre.to_frame()
songs_per_genre.columns = ['count']
songs_per_genre['percentage'] = (songs_per_genre['count'] / songs_per_genre['count'].sum()) * 100
neglected_text = "Neglected Portions:\n" + "\n"
# Create labels for the pie chart
def create_label(row):
global neglected_text
if row['percentage'] < 2:
neglected_text += f"{row.name} - {row['count']} ({row['percentage']:.1f})%\n"
return ''
else:
return f"{row.name} - {row['count']} ({row['percentage']:.1f}%)"
songs_per_genre['label'] = songs_per_genre.apply(create_label, axis=1)
# Plot
fig, ax = plt.subplots(figsize=(10, 8))
wedges, texts, autotexts = ax.pie(songs_per_genre['count'], labels=songs_per_genre['label'], autopct='%1.1f%%')
ax.set_title("Number of songs per genre between 2000 and 2015")
plt.text(1.2, 0.5, neglected_text, transform=ax.transAxes, fontsize=10)
# Hide labels and autotexts for neglected portions
for text, autotext in zip(texts, autotexts):
if text.get_text() == "":
text.set_visible(False)
autotext.set_visible(False)
plt.show()
# Merge the datasets
complete_info = musicInfo.merge(listeningHistory, on='track_id')
# Filter data for the specified genres
filtered_data = complete_info[(complete_info['genre'].isin(['Metal', 'Pop', 'Electronic', 'Rock', 'RnB', 'Rap'])) &
(complete_info['year'] >= 1980) &
(complete_info['year'] <= 2015)]
# Create subplots
fig, axs = plt.subplots(3, 2, figsize=(14, 15)) # Adjusted size and layout for 6 plots
# Create a list of genres and axs indices
genres = ['Metal', 'Pop', 'Electronic', 'Rock', 'RnB', 'Rap'] # Added 'Rap'
axs_indices = [(0,0), (0,1), (1,0), (1,1), (2,0), (2,1)] # Added indices for the additional plots
# For each genre, filter data and plot histogram
for genre, ax_index in zip(genres, axs_indices):
genre_data = filtered_data[filtered_data['genre'] == genre]
axs[ax_index].hist(genre_data['year'], bins=16, color='blue', edgecolor='black', alpha=0.7)
axs[ax_index].set_title(f'Yearly Distribution for {genre} (2000-2015)')
axs[ax_index].set_xlabel('Year')
axs[ax_index].set_ylabel('Number of Songs')
# Adjust the spacing between subplots
plt.tight_layout()
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Define a function to normalize a pandas Series
def normalize(series):
return (series - series.min()) / (series.max() - series.min())
# Convert the genres to a set for fast lookup
genre_set = set(musicInfo['genre'].str.lower())
# Create a DataFrame that maps each artist to a genre
artists_genres = pd.DataFrame([(row['artist_lastfm'], tag.strip()) for _, row in artists.iterrows() for tag in str(row['tags_lastfm']).split(';') if tag.strip() in genre_set], columns=['artist', 'genre'])
# Merge this DataFrame with artists on 'artist'
merged_artists = pd.merge(artists, artists_genres, left_on='artist_lastfm', right_on='artist')
# Apply logarithmic transformation and then normalize 'listeners_lastfm' and 'scrobbles_lastfm' columns
merged_artists['listeners_lastfm'] = normalize(np.log1p(merged_artists['listeners_lastfm']))
merged_artists['scrobbles_lastfm'] = normalize(np.log1p(merged_artists['scrobbles_lastfm']))
merged_artists['listeners_lastfm'] = merged_artists['listeners_lastfm'].fillna(merged_artists['listeners_lastfm'].mean())
merged_artists['scrobbles_lastfm'] = merged_artists['scrobbles_lastfm'].fillna(merged_artists['scrobbles_lastfm'].mean())
# Scatter Plot
plt.figure(figsize=(12, 6))
plt.scatter(merged_artists['listeners_lastfm'], merged_artists['scrobbles_lastfm'], alpha=0.5, s=1)
plt.title('Scatter Plot')
plt.xlabel('Normalized Listeners_lastfm')
plt.ylabel('Normalized Scrobbles_lastfm')
plt.show()
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
properties = ['danceability', 'energy', 'loudness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo']
# Group by genre and calculate mean for each property
grouped = musicInfo.groupby('genre')[properties].mean()
min_counts = 50
genres_to_include = musicInfo['genre'].value_counts()
genres_to_include = genres_to_include[genres_to_include > min_counts].index
grouped = grouped.loc[grouped.index.isin(genres_to_include)]
fixed_order = sorted(grouped.index)
colors = sns.color_palette("tab10", len(properties))
for idx, prop in enumerate(properties):
plt.figure(figsize=(15,7))
grouped.loc[fixed_order][prop].plot(kind='bar', color=colors[idx])
plt.title(f'Average {prop.capitalize()} by Genre')
plt.ylabel(prop.capitalize())
plt.xlabel('Genre')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Sum up playcount for each track
playcounts_summed = listeningHistory.groupby('track_id')['playcount'].sum().reset_index()
def normalize(series):
return (series - series.min()) / (series.max() - series.min())
# Merge the two DataFrames
merged = pd.merge(musicInfo, playcounts_summed, on='track_id')
merged['log_playcount'] = np.log(merged['playcount'] + 1)
merged['log_playcount'] = normalize(merged['log_playcount'])
properties = ['danceability', 'energy', 'loudness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo']
# Normalize the properties
for prop in properties:
merged[prop] = normalize(merged[prop])
sns.set_style("whitegrid")
# Loop through the properties and create the plots
for prop in properties:
# Hexbin plot
plt.figure(figsize=(10, 8))
sns.jointplot(data=merged, x=prop, y='log_playcount', kind='hex', gridsize=50, cmap='viridis', marginal_kws=dict(kde=True))
plt.xlabel(prop.capitalize())
plt.ylabel('Normalized Log Total Playcount')
plt.suptitle(f'{prop.capitalize()} vs Normalized Log Total Playcount', y=1.02)
# Histogram
plt.figure(figsize=(10, 8))
sns.histplot(data=merged, x='log_playcount', bins=30, kde=True, color='blue', label='Log Playcount')
sns.histplot(data=merged, x=prop, bins=30, kde=True, color='red', label=prop.capitalize(), ax=plt.gca())
plt.xlabel(f'Normalized Log Total Playcount and {prop.capitalize()}')
plt.ylabel('Frequency')
plt.title(f'Histogram of {prop.capitalize()} and Normalized Log Total Playcount')
plt.legend()
plt.tight_layout()
plt.show()
C:\Users\23575\AppData\Local\Temp\ipykernel_22476\1017712067.py:35: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`. plt.figure(figsize=(10, 8))
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Assuming your DataFrames are named df_song, df_artist
merged_df = pd.merge(musicInfo, artists[['artist_mb', 'country_mb']],
left_on='artist', right_on='artist_mb', how='inner')
# Drop unnecessary columns
merged_df.drop(columns=['artist_mb'], inplace=True)
top_countries = merged_df['country_mb'].value_counts().head(20).index.tolist()
print(top_countries)
filtered_df = merged_df[merged_df['country_mb'].isin(top_countries)]
properties = ['danceability', 'energy', 'loudness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo']
# Define the figure size and number of plots in the grid
fig, axes = plt.subplots(nrows=len(properties), figsize=(20, 8*len(properties)))
for i, prop in enumerate(properties):
sns.violinplot(data=filtered_df, x='country_mb', y=prop, palette='rainbow', ax=axes[i])
axes[i].set_title(f"Distribution of {prop.capitalize()} by Country")
axes[i].set_xticklabels(filtered_df['country_mb'].unique(), rotation=45)
plt.tight_layout()
plt.show()
['United States', 'United Kingdom', 'Germany', 'Sweden', 'Canada', 'France', 'Finland', 'Australia', 'Norway', 'Jamaica', 'Ireland', 'Poland', 'Brazil', 'Netherlands', 'Denmark', 'Italy', 'Switzerland', 'Iceland', 'Japan', 'Belgium']
Conclusion
Throughout this extensive analysis of music data from different eras and regions, we have gained valuable insights into various aspects of music trends, preferences, and the factors influencing music popularity. By comparing play counts between the periods 1950-1980 and 1980-2020, we observed significant shifts in music preferences, with the digital age witnessing a surge in engagement with music due to technological advancements.
The evolution of music genres across different eras revealed dynamic changes in genre popularity, reflecting broader cultural shifts and technological influences. Listeners' genre preferences also demonstrated shifts over time, shaped by key events and advancements in the music landscape.
Notably, certain non-rock genres experienced substantial growth in play counts from 2000 to 2015, providing insights into the changing landscape of music popularity. Furthermore, a correlation analysis between audio features and play counts uncovered potential relationships, with attributes like danceability and energy attracting more listeners and contributing to higher play counts.
Examining the relationship between listeners and scrobbles on Last.fm emphasized the importance of audience engagement in influencing an artist's success. Factors such as the total released amount of music, count of releases by year, and music genre attributes all played a role in artist popularity and playcount.
Our study also highlighted the significant influence of cultural context on music attributes found in popular tracks. Genre preferences, thematic elements, and stylistic choices were shown to be influenced by specific cultural backgrounds, contributing to music popularity within distinct regions.
In conclusion, this thorough analysis offers valuable insights for music platforms, artists, and stakeholders to optimize content, engage with audiences effectively, and make informed decisions in the ever-evolving music industry. By understanding these patterns and trends, the music industry can continue to thrive and adapt to the changing preferences of global audiences.